Applying Multidimensional Scaling (MDS) to Trader Joe’s Cheese

Multidimensional scaling (MDS) is a method that allows you to create a low-dimensional model of a set of objects that maintains as much of the inter-object distance relationship as possible. MDS does this using a set of pair-wise distances for a set of objects. It takes this and creates a set of points in a low-dimensional Euclidean space, where each point corresponds to a different object in the set such that the pair-wise distances between the points are as similar as possible to the corresponding pair-wise distances between the objects.

The set of objects does not have to have a known high-dimensional model. Moreover, the choice of how the distances between the objects are defined is up to us; we can choose among many distance metrics - one simple example is euclidean distance, which is what I will be using to create my models in the following sections. We then create a symmetric distance matrix, \(D\), in which \(D_{ij}\) is the distance between objects \(i\) and \(j\) and \(D_{ij} = D_{ji}\) and \(D_{ii} = 0\).

I will applied MDS to a dataset I created by webscraping nutritional and other information on cheeses from the Trader Joes’s website: https://www.traderjoes.com/home/products/category/cheese-29

Preparing the Data to Model

The process to clean the data was tedious and done in Python. This process is commented well in the python notebook on the repository page here: https://github.com/lindseygao/mds-tj-cheese. The repository also includes the clean dataset, named clean_df.

After cleaning the data, which involved dropping necessary columns and filling in missing data, I ended with a dataset with the following nutrition features:

## Rows: 31
## Columns: 9
## $ price              <dbl> ~
## $ calories           <dbl> ~
## $ total.fat          <dbl> ~
## $ saturated.fat      <dbl> ~
## $ cholesterol        <dbl> ~
## $ sodium             <dbl> ~
## $ total.carbohydrate <dbl> ~
## $ protein            <dbl> ~
## $ calcium            <dbl> ~

There are 31 observations (different cheeses) and 9 nutrition features. A preview of the dataset with the first 5 observations/cheeses is shown below:

Preview of Dataset: First 5 Observations
product price calories Total Fat Saturated Fat Cholesterol Sodium Total Carbohydrate Protein Calcium
All Natural Fresh Mozzarella Cheese 5.980 70 6 3.5 20 80 0 5 80
Fancy Shredded Mexican Style Cheese Blend 4.920 100 8 5.0 25 150 3 6 190
Spicy Buffalo Cheddar 7.990 100 8 6.0 25 320 0 6 150
Quattro Formaggi 6.653 110 8 6.0 25 230 0 7 250
Garlic Bread Cheese 11.440 90 7 4.5 25 200 2 6 200

The units of the columns are as follows:

  • Price: dollars/lb
    • Serving size: g (standardized to be 28g)
    • Calories: calories
    • Total Fat: g
    • Saturated Fat: g
    • Cholesterol: mg
    • Sodium: mg
    • Total Carbohydrate: g
    • Protein: g
    • Calcium: mg

Note that the nutrition information has been calculated/standardized for 1 serving size of 28 grams.

Since the units and scales of columns are different, we need to adjust/normalize the columns of the input data so that they are all “comparable.” This is to ensure that the scale of each dimension of the vectors does not overly affect the distance calculations and, hence, our MDS model.

I normalized the data using the min-max normalization method so that the results are not skewed by the units of each observation. Min-max normalization calculates the min and max value in each column, and then maps each column entry x to (x-min)/(max-min). This transformation of the data will result in a new data set in which each column has a minimum of 0 and a maximum of 1.

Initial Plots

Below is the eigenvalue plot of the model:

We see that the first 4 eigenvalues are pretty large.

The first eigenvalue captures 0.464 of the total energy (proportion of the total eigenvalues) in the data. The first 2 eigenvalues capture 0.7 of the total energy and the first 3 eigenvalues capture 0.796 of the total energy.

One-Dimensional Model

Below is the plot of the one dimensional model of the data:

We can also plot how the distances produced in the one-dimensional model differ from the original distances:

The line y = x is also plotted (indicating a perfect fit). We observe that the observations are somewhat close to the line but noticeable deviations are visible.

We can evaluate the model interms of 3 additional metrics: (1) the goodness-of-fit (GOF), (2) mean absolute difference and the (3) mean squared difference of the model’s distance and the true distances.

  • The mean absolute difference is 0.4055
    • The mean squared difference is 0.2429
    • The GOF value is 0.4643

Below is an interactive plot of the one dimensional model of the data to easily see the labels of variou cheeses. Note that the color scale is the price of the cheese in dollars/lb.

It’s interesting to note that the vegan cheese options are clustered together and the goat cheese and feta cheese options are also somewhat grouped together. However, the majority of cheeses seem to reside on the left side (many of which are cheddar cheeses).

Two-Dimensional Model

Below is a plot of the 2 dimensional model:

Below is the distance plot of the 2 dimensional model:

We observe that the distance plot of the two-dimensional model fit the line \(y = x\) much better than the one-dimensional model.

The additional evaluations metrics for the two dimensional model are:

  • Mean absolute difference: 0.2289
    • Mean squared difference: 0.086
    • GOF value: 0.7001.

Below is an interactive plot of the two dimensional model of the data. Again, the color scale is the price of the cheese in dollars/lb.

In the two dimensional model, we once again see the vegan cheeses closely clustered together, signifying much dissimilarity from the other cheeses (which is to be expected based on nutritional composition). We also see ricotta is all the way in the far upper right corner and is the only ricotta cheese in this dataset. Again, we see that the many variations of cheddar cheese are also clustered together towards the left. A fun note is that unique cheeses “garlic bread cheese” and “pizza bread cheese” are also in the big cheddar cheese cluster and right next to each other.

Three-Dimensional Model

Below is the distance plot comparing the distances of the three-dimensional model and the true normalized distances:

We observe that the distance plot of the three-dimensional model fit the line \(y = x\) slightly better than the two-dimensional model, but the difference is not as great as increasing from one to two dimensions. Again, I have calculated the additional evaluation metrics below:

  • Mean absolute difference: 0.1536
    • Mean squared difference: 0.0399
    • GOF value: 0.796.

Summary of Models

Below is a table summary of the various evaluation metrics for the three different models:

##                   1D Model
## GOF               0.4643  
## Mean Abs Diff     0.4055  
## Mean Squared Diff 0.2429  
##                   2D Model
## GOF               0.7001  
## Mean Abs Diff     0.2289  
## Mean Squared Diff 0.086   
##                   3D Model
## GOF               0.796   
## Mean Abs Diff     0.1536  
## Mean Squared Diff 0.0399

Below is all three distance plots together for easy comparison:

We see that there are huge improvements to the model when increase the dimension from one to two and a smaller improvement from increasing the dimension from two to three. This aligns with what we saw from the distance plots.

Investigating Meaning of Dimensions

A question that is often asked is how should we interpret the dimensions of our model? We can first how the dimension of our model correlates to each feature. Below is a table of the correlation value between the first dimension and each column of our dataset:

Correlation Between Features & First Dimension
Feature Correlation
price -0.286
calories -0.7819
total.fat -0.6227
saturated.fat -0.3174
cholesterol -0.8091
sodium -0.1681
total.carbohydrate 0.4475
protein -0.886
calcium -0.8587

We see that there seems to be high correlation for the protein, calories, cholesterol, and calcium columns. We can plot how the first dimension compares to each of these columns with high correlation:

All of these four columns are negatively correlated to dimension one, suggesting that dimension one may be a combination of these four features.

We can do a similar correlation calculation for the second dimension. The correlation table is shown below:

Correlation Between Features & Second Dimension
Feature Correlation
price -0.1046
calories -0.5701
total.fat -0.7015
saturated.fat -0.9054
cholesterol 0.3167
sodium -0.5901
total.carbohydrate -0.3404
protein 0.3368
calcium 0.1717

We see that there seems to be high correlation for the saturated fat and total fat columns. We can now plot these columns against the second dimension:

The second dimension seems to represent the fat level of the cheese, with a lower dimension two value corresponding to a higher fat content.